Day 16 [Python ML、Pandas] 組成群組合排序

2021 iThome 鐵人賽

DAY 16

AI & Data

使用python學習Machine Learning系列第 16 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-09-29 09:07:10

2495 瀏覽

分享至

import pandas as pd
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

Groupwise analysis

我們可以將同樣類型的東西group起來，並且計算出數量

reviews.groupby('points').points.count()

points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

我們可以使用summary function來處理資料，例如說可以取得group後的min

reviews.groupby('points').price.min()

points
80      5.0
81      5.0
       ... 
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

可以用apply來處理groupby後的資料

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

可以利用lambda來取出group中，point最大的值

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

另外一個可以值得一提的function是agg()

可以利用這個函式取得summary的function

reviews.groupby(['country']).price.agg([len, min, max])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

有效的利用groupby可以對資料做很多有力的處理

Multi-indexes

一般來說看到的都是single-label index

但是由於groupby可以group多個feature，因此會產生multi-indexs

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

mi = countries_reviewed.index
type(mi)

pandas.core.indexes.multi.MultiIndex

可以利用reset_index()方法將MultiIndex轉為SingleIndex

countries_reviewed.reset_index()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Sorting

reset index完了之後，可以用sort_values根據len這個column做排序

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

sort_values也能倒轉排序，將參數ascending設為False

countries_reviewed.sort_values(by='len', ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

若要根據index做排序，可以使用sort_index()函式

countries_reviewed.sort_index()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

sort也可以一次排序多個column

countries_reviewed.sort_values(by=['country', 'len'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Day 15 [Python ML、Pandas] 統整資料和Maps

Day 17 [Python ML、Pandas] 資料類型和遺失值

系列文

使用python學習Machine Learning 共 29 篇

RSS系列文訂閱系列文

5 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19855 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

使用python學習Machine Learning系列 第 16 篇